PE File Clustering and Yara Signature Generation

Snap into a Yara sig!

Once upon a time we looked at classifying PE and Mach-O files. This time it's flipped on its head. Is it possible to use various clustering algorithms to group similar files together? But, why stop there!? Can we crank up the awesome and use information from those clusters to generate Yara signatures to find files that are similar in nature?

In this notebook we'll explore not only gathering static information from PE files, but clustering on those attributes, and finally show off the capabilities of the Yara signature generation.

Tools

IPython (http://www.ipython.org)
pandas (http://pandas.pydata.org)
pefile (https://code.google.com/p/pefile/)
Yara (http://plusvic.github.io/yara/)

What we did:

Gathered data about PE files with pefile (JSON)
Read that data in
Data cleanup
Explored the Data!
- Graphs, clustering
Analyze Results
Yara signatures
More clustering
More analyzing



In [28]:

    
# All the imports and some basic level setting with various versions
import IPython
import re
import os
import json
import time
import string
import pandas
import pickle
import struct
import socket
import collections
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pefile

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

print "IPython version: %s" %IPython.__version__
print "pandas version: %s" %pd.__version__
print "numpy version: %s" %np.__version__

%matplotlib inline









    



IPython version: 2.1.0
pandas version: 0.13.1
numpy version: 1.8.1



In [3]:

    
def get_lang_value(lang):
    for key, value in pefile.LANG.iteritems():
        if value == lang:
            return key
    return 0



In [4]:

    
# Grab from the json data what we want
def extract_features(filename, data):
    feature = {}
    
    feature['filename'] = filename[26:-8]
        
    feature.update(data['verbose']['pefile']['file header'])
    feature.update(data['verbose']['pefile']['optional header'])
    feature['image base'] = float(feature['image base'])
    feature['size of stack reserve'] = float(feature['size of stack reserve'])
    feature['size of stack commit'] = float(feature['size of stack commit'])
    feature['size of heap reserve'] = float(feature['size of heap reserve'])
    feature['size of heap commit'] = float(feature['size of heap commit'])
    if 'size of image base var' in feature:
        del feature['size of image base var']

    if 'data directories' in data['verbose']['pefile']:
        for k,v in data['verbose']['pefile']['data directories'].iteritems():
            feature['data dir ' + k + ' rva'] = v['rva']
            feature['data dir ' + k + ' size'] = v['size']
 
    '''
    if 'sections' in data['verbose']['pefile']:
        for idx, sec in enumerate(data['verbose']['pefile']['sections']):
            feature['section ' + str(idx) + ' virtual address'] = sec['virtual address']
            feature['section ' + str(idx) + ' virtual size'] = sec['virtual size']

            if idx == 2:
                break
    '''
    if 'resources' in data['verbose']['pefile']:
        feature['number of resources'] = len(data['verbose']['pefile']['resources'])
        for index, resource in enumerate(data['verbose']['pefile']['resources']):
            feature['resource ' + str(index) + ' lang'] = get_lang_value(resource['lang'])
            feature['resource ' + str(index) + ' size'] = resource['size']
            feature['resource ' + str(index) + ' rva'] = resource['rva']

            if index == 2:
                break
    return feature



In [5]:

    
def extract_vtdata(filename, data):
    vt = {}
    vt['filename'] = filename[26:-7]
    if 'scans' in data:
        if data['positives'] > 0:
            vt['label'] = 'malicious'
        else:
            vt['label'] = 'nonmalicious'
        vt['positives'] = data['positives']
        if 'Symantec' in data['scans']:
            vt['symantec'] = data['scans']['Symantec']['result']
        if 'Sophos' in data['scans']:
            vt['sophos'] = data['scans']['Sophos']['result']
        if 'F-Prot' in data['scans']:
            vt['f-prot'] = data['scans']['F-Prot']['result']
        if 'Kaspersky' in data['scans']:
            vt['kaspersky'] = data['scans']['Kaspersky']['result']
        if 'McAfee' in data['scans']:
            vt['mcafee'] = data['scans']['McAfee']['result']
        if 'Malwarebytes' in data['scans']:
            vt['malwarebytes'] = data['scans']['Malwarebytes']['result']
    else:
        vt['label'] = 'nonmalicious'
        vt['positives'] = 0
    return vt



In [6]:

    
def load_files(file_list):
    import json
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features = extract_features(filename, json.loads(f.read()))
            features_list.append(features)
    return features_list

import glob
file_list = glob.glob('pefile_clustering_bsidelv/*.results')
features = load_files(file_list)
print "Files:", len(file_list)









    



Files: 1000



In [7]:

    
def load_vt_data(file_list):
    import json
    features_list = []
    for filename in file_list:
        with open(filename,'rb') as f:
            features = extract_vtdata(filename, json.loads(f.read()))
            features_list.append(features)
    return features_list

import glob
file_list = glob.glob('pefile_clustering_bsidelv/*.vtdata')
vt_data = load_vt_data(file_list)



In [8]:

    
df = pd.DataFrame.from_records(features)
for col in df.columns:
    if 'resource' in col[0:7]:
        df[col].fillna(-1, inplace=True)
        
df.fillna(-1, inplace=True)
df.head(5)









    Out[8]:






  
    
      
      base of code
      base of data
      characteristics
      checksum
      compile date
      data dir base relocation rva
      data dir base relocation size
      data dir debug rva
      data dir debug size
      data dir exception table rva
      data dir exception table size
      data dir export table rva
      data dir export table size
      data dir import address table rva
      data dir import address table size
      data dir import table rva
      data dir import table size
      data dir resource table rva
      data dir resource table size
      data dir tls table rva
      
    
  
  
    
      0
       4096
       483328
       8462
                0
       1357217520
       835584
       28956
          0
        0
       0
       0
       633648
         76
       483328
       1688
       624872
       260
       811008
        24256
       0
      ...
    
    
      1
       4096
       270336
       8450
           408181
       1365113048
       380928
       12380
       5472
       28
       0
       0
       267296
        134
         4096
       1292
       260852
       260
       274432
       103416
       0
      ...
    
    
      2
       4096
        28672
        271
       2018515944
       1260053446
            0
           0
          0
        0
       0
       0
            0
          0
        28672
        652
        29604
       180
       208896
        16944
       0
      ...
    
    
      3
       4096
        28672
        271
          5723026
       1260053452
            0
           0
          0
        0
       0
       0
            0
          0
        28672
        652
        29872
       180
       299008
       131024
       0
      ...
    
    
      4
       4096
       151552
       8450
           313813
       1306975492
       241664
       10008
          0
        0
       0
       0
       195360
       1163
       151552
       1128
       189700
       220
       221184
        19560
       0
      ...
    
  

5 rows × 63 columns



In [9]:

    
df_vt = pd.DataFrame.from_records(vt_data)
df_vt.fillna('No detection', inplace=True)
df_vt.head(5)









    Out[9]:






  
    
      
      f-prot
      filename
      kaspersky
      label
      malwarebytes
      mcafee
      positives
      sophos
      symantec
    
  
  
    
      0
       W32/Agent.EW.gen!Eldorado
       0027e07dccc0ecbd051591607262bfd5d856adecf986e6...
                               No detection
          malicious
                   No detection
       Artemis!A9D90198DF20
       35
                   No detection
          No detection
    
    
      1
                    No detection
       011809e9e92f82018c0e2425fa976d071b7acbff7a342d...
                               No detection
       nonmalicious
                   No detection
               No detection
        0
                   No detection
          No detection
    
    
      2
                    No detection
       0130326c71bc3fe20fe13e7e2aac753fb6c178a4c1dd50...
                               No detection
          malicious
            PUP.Optional.Domalq
       Artemis!A2ABED494338
       21
       DomainIQ pay-per install
            Trojan.ADH
    
    
      3
                 W32/Trojan3.IUT
       013d8c4ea64f1c5ce424ae224ae65adfd11a3982d3faba...
       not-a-virus:AdWare.Win32.Lyckriks.cw
          malicious
       PUP.Optional.OpenCandy.A
       Adware-OpenCandy.dll
       28
                 Generic PUA LN
       WS.Reputation.1
    
    
      4
                    No detection
       01a8d23e2b114162262eaffd1c311450b56efdf3063372...
                               No detection
       nonmalicious
                   No detection
               No detection
        0
                   No detection
          No detection
    
  

5 rows × 9 columns



In [10]:

    
cols = [x for x in df.columns.tolist() if x != 'filename']

Let's look at the raw data. But first, since we humans can't really visualize things with over 60 dimensions, we can use PCA to project all those features down to a few so that we can graph. In this case, we'll look at 2D and 3D images that represent the data. It's interesting to see how much information is lost between 3D and 2D... too bad we can't see how the data would look in all the dimensions!



In [11]:

    
X = df.as_matrix(cols)
from sklearn.preprocessing import scale
X = scale(X)

from sklearn.decomposition import PCA
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)



In [12]:

    
from mpl_toolkits.mplot3d import Axes3D

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], s=50)
ax.set_title("Features in 3D")
ax = fig.add_subplot(1, 2, 2)
ax.scatter(DD[:,0], DD[:,1], s=50)
ax.set_title("Features in 2D")
plt.show()

First up is DBSCAN, it enjoys long walks on the beach, non-flat geometry, and uneven cluster sizes (http://scikit-learn.org/stable/modules/clustering.html). This seemed like a good selection for many different reasons. We expect to have several uneven cluster sizes as this sample of files contains both malware and nonmalicious binaries. By building the features from the file structure, this should pick out several different tool chains (compilers, etc...) used and it would be surprising to have even distributions of that type of information in the data set. Hopefully we will even be able to cluster malware families together. Another nice feature of the scikit learn implementation is that all samples that don't belong to a cluster are labeled with "-1". This avoid shoving files into clusters and reducing the efficency of any generated Yara signature. However, if we're searching for more generic sigs we can play games to get more samples in clusters or use different algoritms.

We also show the difference between non-scaled and non-reduced data, and how you can get different (and usually better) results by scaling and reducing.



In [17]:

    
from sklearn.cluster import DBSCAN
X = df.as_matrix(cols)

dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

dbscan_df = df[['filename','cluster']]

print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()









    



Number of clusters: 7
Labeled samples: 36
Unlabeled samples: 964

We can see without PCA just about everything is unlabeled. Let's try again but using PCA. First we determine how many dimensions to reduce to, then we cluster.



In [18]:

    
X = df.as_matrix(cols)
X = scale(X)
pca = PCA().fit(X)
n_comp = len([x for x in pca.explained_variance_ if x > 1e0])
print "Number of components w/explained variance > 1: %s" % n_comp









    



Number of components w/explained variance > 1: 20



In [19]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)

dbscan_df = df[['filename','cluster']]

print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()









    



Number of clusters: 63
Labeled samples: 418
Unlabeled samples: 582

Half the files ended up unclustered, so that's a little disappointing, but still a huge improvement.



In [20]:

    
dbscan_df.cluster.value_counts().head(10)









    Out[20]:





-1     582
 9      58
 5      32
 11     30
 4      23
 19     19
 2      16
 6      13
 23     10
 29      9
dtype: int64

Let's see these clusters in 3D and 2D now.



In [21]:

    
# Remove unlabeled samples for graphing to make it prettier
tempdf = df[df['cluster'] != -1].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)

figsize(12,12)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(2, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 2, projection='3d')
ax.set_xlim(-5,5)
ax.set_ylim(-5,15)
ax.set_zlim(-5,5)
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
ax = fig.add_subplot(2, 2, 3)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 4)
ax.set_xlim(-3,4)
ax.set_ylim(-5,7)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
plt.show()

Let's see how well DBSCAN did. To this end, we use data from VirusTotal to help us.



In [22]:

    
dbscan_vt_df = pd.merge(dbscan_df, df_vt, on='filename', how='outer')
dbscan_vt_df.head()









    Out[22]:






  
    
      
      filename
      cluster
      f-prot
      kaspersky
      label
      malwarebytes
      mcafee
      positives
      sophos
      symantec
    
  
  
    
      0
       0027e07dccc0ecbd051591607262bfd5d856adecf986e6...
       42
       W32/Agent.EW.gen!Eldorado
                               No detection
          malicious
                   No detection
       Artemis!A9D90198DF20
       35
                   No detection
          No detection
    
    
      1
       011809e9e92f82018c0e2425fa976d071b7acbff7a342d...
       -1
                    No detection
                               No detection
       nonmalicious
                   No detection
               No detection
        0
                   No detection
          No detection
    
    
      2
       0130326c71bc3fe20fe13e7e2aac753fb6c178a4c1dd50...
       25
                    No detection
                               No detection
          malicious
            PUP.Optional.Domalq
       Artemis!A2ABED494338
       21
       DomainIQ pay-per install
            Trojan.ADH
    
    
      3
       013d8c4ea64f1c5ce424ae224ae65adfd11a3982d3faba...
       -1
                 W32/Trojan3.IUT
       not-a-virus:AdWare.Win32.Lyckriks.cw
          malicious
       PUP.Optional.OpenCandy.A
       Adware-OpenCandy.dll
       28
                 Generic PUA LN
       WS.Reputation.1
    
    
      4
       01a8d23e2b114162262eaffd1c311450b56efdf3063372...
        7
                    No detection
                               No detection
       nonmalicious
                   No detection
               No detection
        0
                   No detection
          No detection
    
  

5 rows × 10 columns

Below, we can see here that most of the clusters do not mix malicious and nonmalicious, that's a good start. And looking further at some of the malicious clusters, we can see that it's doing a pretty good job of grouping families together.

Hooray, it's useful!



In [23]:

    
clusters = set()
print "Total Number of Clusters: %s\n" % (len(dbscan_vt_df['cluster'].unique().tolist()))
for name, blah in dbscan_vt_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



Total Number of Clusters: 63

-1.0 Cluster has both Malicious and Non-Malicious Samples
0.0 Cluster has both Malicious and Non-Malicious Samples
7.0 Cluster has both Malicious and Non-Malicious Samples
12.0 Cluster has both Malicious and Non-Malicious Samples
26.0 Cluster has both Malicious and Non-Malicious Samples
27.0 Cluster has both Malicious and Non-Malicious Samples
31.0 Cluster has both Malicious and Non-Malicious Samples
49.0 Cluster has both Malicious and Non-Malicious Samples



In [24]:

    
dbscan_cluster_results = dbscan_vt_df.groupby(['cluster', 'label']).count()
dbscan_cluster_results[['filename']].head(10)









    Out[24]:






  
    
      
      
      filename
    
    
      cluster
      label
      
    
  
  
    
      -1
      malicious
       245
    
    
      nonmalicious
       337
    
    
       0
      malicious
         1
    
    
      nonmalicious
         2
    
    
       1
      malicious
         3
    
    
       2
      nonmalicious
        16
    
    
       3
      nonmalicious
         5
    
    
       4
      nonmalicious
        23
    
    
       5
      malicious
        32
    
    
       6
      nonmalicious
        13
    
  

10 rows × 1 columns



In [25]:

    
dbscan_vt_df[dbscan_vt_df['filename'] == 'dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d0dadbd460de91b411c']









    Out[25]:






  
    
      
      filename
      cluster
      f-prot
      kaspersky
      label
      malwarebytes
      mcafee
      positives
      sophos
      symantec
    
  
  
    
      816
       dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d...
       29
       W32/Vobfus.AA.gen!Eldorado
       Worm.Win32.WBNA.bmf
       malicious
       Trojan.Downloader.ic
       Generic VB.kk
       48
       Mal/SillyFDC-T
       W32.Changeup!gen15
    
  

1 rows × 10 columns



In [26]:

    
cluster_dc2 = dbscan_vt_df[dbscan_vt_df['cluster'] == 29]
cluster_dc2[['f-prot', 'mcafee', 'symantec', 'sophos', 'kaspersky', 'malwarebytes']]









    Out[26]:






  
    
      
      f-prot
      mcafee
      symantec
      sophos
      kaspersky
      malwarebytes
    
  
  
    
      59 
           W32/VB.ID.gen!Eldorado
       W32/Autorun.worm.aaeh
             W32.Changeup
        Troj/Agent-ZQE
          Worm.Win32.Vobfus.atyr
               No detection
    
    
      188
       W32/Vobfus.AI.gen!Eldorado
               Generic VB.kk
       W32.Changeup!gen15
        Mal/SillyFDC-U
       Trojan.Win32.VBKrypt.izdo
            Worm.Obfuscated
    
    
      327
       W32/Vobfus.BE.gen!Eldorado
       Generic Downloader.oq
             W32.Changeup
        Mal/SillyFDC-W
             Worm.Win32.WBNA.bul
            Worm.Obfuscated
    
    
      369
        W32/Vobfus.O.gen!Eldorado
                  VBObfus.dv
             W32.Changeup
       W32/Autorun-BWV
             Worm.Win32.WBNA.ipa
            Worm.Obfuscated
    
    
      503
       W32/Vobfus.AA.gen!Eldorado
               Generic VB.kk
             W32.Changeup
        Mal/SillyFDC-T
             Worm.Win32.WBNA.bqi
               No detection
    
    
      506
       W32/Vobfus.AQ.gen!Eldorado
                  VBObfus.dv
             W32.Changeup
         W32/Vobfus-AI
             Worm.Win32.WBNA.ipa
       Trojan.Downloader.ic
    
    
      816
       W32/Vobfus.AA.gen!Eldorado
               Generic VB.kk
       W32.Changeup!gen15
        Mal/SillyFDC-T
             Worm.Win32.WBNA.bmf
       Trojan.Downloader.ic
    
    
      873
       W32/Vobfus.AI.gen!Eldorado
               Generic VB.kk
       W32.Changeup!gen17
        Mal/VBCheMan-B
             Worm.Win32.WBNA.bul
            Worm.Obfuscated
    
    
      921
        W32/Vobfus.O.gen!Eldorado
       W32/Autorun.worm.aaeh
             W32.Changeup
       W32/Autorun-BWV
             Worm.Win32.WBNA.ipa
       Trojan.Downloader.ic
    
  

9 rows × 6 columns

Perfect, we've got our files clustered in groups that appear to be similar/close to one other. This is great if we wanted to stop here, but how many times can you run a Python model on machines in an enterprise or on an appliance in order to find similar files? Probably not very often. Instead we need to get this information out of Python and into something usable on files: Yara!

Below you'll see a simple call-out to a yara_signature python module. This module contains code to generate a signature based on attributes found in the file. We've chosen a cluster (3) and a file from that cluster to base the signature off of. Then the attributes from the cluster that are non-zero (present) are added to the signature. Some of the struct values can be influenced in the sig, and that's the reason for the multiple lists to keep track of various attributes.



In [29]:

    
import yara_signature
import struct

name = 26
fdf = pd.DataFrame()
for f in dbscan_df[dbscan_df['cluster'] == name].filename.tolist():
    fdf = fdf.append(df[df['filename'] == f], ignore_index=True)
    
# Choose a signature from cluster to use as the basis of the sig w/the attributes below
filename = 'dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d0dadbd460de91b411c'
meta = {"author" : "dorsey", "email" : "dorsey_at_clicksecurity_dot_com"}

sig = yara_signature.yara_pe_generator.YaraPEGenerator('./'+filename, samplename="Cluster_"+str(name), meta=meta)

file_header_columns = ["pointer to symbol table", "characteristics", "number of symbols", "size of optional header",
                        "machine", "compile date", "number of sections"]

optional_header_columns = ["subsystem", "major image version", "image base", "size of heap reserve",
                           "major operating system version", "section alignment", "loader flags",
                           "minor subsystem version", "major linker version", "size of stack commit",
                           "size of code", "size of image", "number of rva and sizes", "dll charactersitics",
                           "file alignment", "size of stack reserve", "minor linker version", "base of code",
                           "size uninit data", "entry point address", "size init data", "major subsystem version",
                           "magic", "checksum", "size of heap commit", "minor image version",
                           "minor operating system version", "size of headers", "base of data", "size of image base var",
                           "data dir base relocation rva", "data dir base relocation size", "data dir debug rva",
                           "data dir debug size", "data dir exception table rva", "data dir exception table size",
                           "data dir export table rva", "data dir export table size", "data dir import address table rva",
                           "data dir import address table rva", "data dir import address table size",
                           "data dir import table rva", "data dir import table size", "data dir import table size",
                           "data dir resource table rva", "data dir resource table size", "data dir tls table rva",
                           "data dir tls table size"]

file_header = []
optional_header = {}

for col in fdf.columns:
    if len(fdf[col].unique()) == 1:
        if fdf[col].unique()[0] != -1:
            lower = [s for s in col if s.islower()]
            if fdf[col].unique()[0] != -1 or (len(lower) == len(col)):
                if col in file_header_columns:
                    file_header.append(col)
                if col in optional_header_columns:
                    optional_header[col] = struct.pack("<I", int(fdf[col].unique()[0])).encode('hex')

    if len(fdf[col].unique()) > 1:
        if col not in optional_header_columns:
            continue

        if type(fdf[col].unique()[0]) == str or len(fdf[col].unique()) > 9:
            continue

        u = []
        z = []
        for value in fdf[col].unique():
            u.append(struct.pack("<I", value).encode("hex"))

        for d in zip(*u):
            match = True
            for idx in range(1,len(d)):
                if d[0] != d[idx]:
                    match = False
                    break
            if match:
                z.append(d[0])
            else:
                z.append('?')
        string = ''.join(z)
        if string != '????????':
            optional_header[col] = string

if len(file_header) > 0:
    sig.add_file_header(file_header)

if len(optional_header) > 0:
    sig.add_optional_header_with_values(optional_header)

print sig.get_signature()









    



rule Cluster_26
{
meta:
    author = "dorsey"
    email = "dorsey_at_clicksecurity_dot_com"
    generator = "This sweet yara sig generator!"

strings:
    $FileHeader = { 4c 01 ?? ?? ?? ?? ?? ?? 00 00 00 00 00 00 00 00 e0 00 }
    $OptionalHeader = { 0b 01 08 00 00 ?? 0? 00 00 ?? 0? 00 00 00 00 00 ?? ?? 0? 00 00 20 00 00 00 ?0 0? 00 00 00 40 00 00 20 00 00 00 02 00 00 04 00 00 00 ?? ?? ?? ?? 04 00 00 00 ?? ?? ?? ?? 00 ?0 0? 00 00 0? 00 00 00 00 00 00 02 00 ?? ?? 00 00 10 00 00 10 00 00 00 00 10 00 00 10 00 00 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 ?? ?? 0? 00 5? 00 00 00 00 ?0 0? 00 ?? ?? 0? 00 00 00 00 00 00 00 00 00 ?? ?? ?? ?? ?? ?? ?? ?? 00 ?0 0? 00 0c 00 00 00 ?? ?? 00 00 1c 00 00 00 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 00 00 00 00 00 00 00 00 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 00 20 00 00 08 00 00 00 }
condition:
    $FileHeader at 204 and
    $OptionalHeader at 224
}

Since we've got one method of clustering to Yara signature down, let's take a brief look at what happens to the cluster shapes/distributions with some other types of cluster algoritms.

Next up, KMeans. It will put every sample into a cluster, and this algorithm the number of clusters needs to be specified. There are a bunch of ways you can determine how many clusters, below we went with a simple one from Wikipedia (http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).



In [32]:

    
from sklearn.cluster import KMeans
X = df.as_matrix(cols)
X = scale(X)
#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = int(math.sqrt(int(len(X)/2)))

kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
kmeans_df = df[['filename', 'cluster']]

print "Number of clusters: %d" % nclusters









    



Number of clusters: 22



In [33]:

    
df.cluster.value_counts().head(10)









    Out[33]:





18    362
9     186
6     136
4     130
0      75
1      68
7      21
21      4
11      2
16      2
dtype: int64



In [34]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("K-Means Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-4,-1)
ax.set_ylim(20,35)
ax.set_zlim(-3,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("K-Means Clusters (zoomed in)")
plt.show()



In [35]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = 22

kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
kmeans_df = df[['filename', 'cluster']]

print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print

X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-3,-1)
ax.set_ylim(20,35)
ax.set_zlim(-3,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters (zoomed in)")
plt.show()









    



Number of clusters: 22

Cluster/Sample Layout
2     370
7     182
15    134
21    127
5      75
0      54
1      28
19      8
12      4
9       4
dtype: int64

Above you can see how scaling and PCA lead to a bit more balanced layout of some of the clusters, but we've still got some outliers. Not a huge deal, just another way to slice and look at the data.

Let's see how kmeans did clustering files.



In [37]:

    
kmeans_vt_df = pd.merge(kmeans_df, df_vt, on='filename', how='outer')
kmeans_cluster_results = kmeans_vt_df.groupby(['cluster', 'label']).count()
kmeans_cluster_results[['filename']].head(10)









    Out[37]:






  
    
      
      
      filename
    
    
      cluster
      label
      
    
  
  
    
      0
      malicious
        37
    
    
      nonmalicious
        17
    
    
      1
      malicious
        13
    
    
      nonmalicious
        15
    
    
      2
      malicious
       185
    
    
      nonmalicious
       185
    
    
      3
      malicious
         2
    
    
      4
      malicious
         1
    
    
      5
      malicious
        71
    
    
      nonmalicious
         4
    
  

10 rows × 1 columns



In [38]:

    
clusters = set()
print "Total Number of Clusters: %s\n" % (len(kmeans_vt_df['cluster'].unique().tolist()))
for name, blah in kmeans_vt_df.groupby(['cluster', 'label'])['label']:
    if name[0] in clusters:
        print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
    clusters.add(name[0])









    



Total Number of Clusters: 22

0 Cluster has both Malicious and Non-Malicious Samples
1 Cluster has both Malicious and Non-Malicious Samples
2 Cluster has both Malicious and Non-Malicious Samples
5 Cluster has both Malicious and Non-Malicious Samples
7 Cluster has both Malicious and Non-Malicious Samples
9 Cluster has both Malicious and Non-Malicious Samples
15 Cluster has both Malicious and Non-Malicious Samples
19 Cluster has both Malicious and Non-Malicious Samples
21 Cluster has both Malicious and Non-Malicious Samples

Below we're looking at MeanShift. Scikit learn is nice enough to tell us a bit about MeanShift usecases (Many clusters, uneven cluster size, non-flat geometry). This seems to, once again, fit our data pretty well. Maybe we can get some better/different layouts of clusters here.



In [39]:

    
from sklearn.cluster import MeanShift, estimate_bandwidth

X = df.as_matrix(cols)
X = scale(X)

ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)

labels1 = ms1.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
meanshift_cluster_df = df[['filename', 'cluster']]

print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters









    



Estimated Bandwidth: 6.20207046217
Number of clusters: 52



In [40]:

    
tempdf = df[df['cluster'] != 0].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,-2)
ax.set_ylim(10,20)
ax.set_zlim(-5,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()



In [41]:

    
df.cluster.value_counts().head(10)









    Out[41]:





0     910
3       8
6       8
1       7
2       6
4       4
5       4
28      4
10      2
36      2
dtype: int64



In [42]:

    
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)

ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)

labels1 = ms1.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
cluster_df = df[['filename', 'cluster']]

print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print

# Once again we can remove, in this case, the largest cluster for a less dense graph
tempdf = df[df['cluster'] != 0].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)

figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,-2)
ax.set_ylim(10,20)
ax.set_zlim(-5,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()









    



Estimated Bandwidth: 4.95944523855
Number of clusters: 42

Cluster/Sample Layout
0     921
3      13
1       8
2       6
4       4
5       4
31      3
9       2
8       2
30      2
dtype: int64



In [43]:

    
ms_vt_df = pd.merge(cluster_df, df_vt, on='filename', how='outer')
kmeans_cluster_results = ms_vt_df.groupby(['cluster', 'label']).count()
kmeans_cluster_results[['filename']].head(10)









    Out[43]:






  
    
      
      
      filename
    
    
      cluster
      label
      
    
  
  
    
      0
      malicious
       462
    
    
      nonmalicious
       459
    
    
      1
      malicious
         1
    
    
      nonmalicious
         7
    
    
      2
      malicious
         6
    
    
      3
      malicious
         6
    
    
      nonmalicious
         7
    
    
      4
      nonmalicious
         4
    
    
      5
      malicious
         4
    
    
      6
      nonmalicious
         2
    
  

10 rows × 1 columns

It seems we've run into a similar case with MeanShift as with DBSCAN. Instead of being unlabed, we wound up with one cluster with the vast majority of samples. Unfortunately, using PCA doesn't help very much, and of the samples remain in that one large cluster.

Overall, it's important to see how using different algorithms can impact the end result. Understanding that impact when trying to transfer knowledge from one domain to another is also important. This way it's possible to see how the various cluster techniques can lead to different Yara signatures which will fire on different sets of files. When dealing with large amounts of malware, this is one way to group existing and detect new potential variants of the same family.

Good luck and happy hunting!

	base of code	base of data	characteristics	checksum	compile date	data dir base relocation rva	data dir base relocation size	data dir debug rva	data dir debug size	data dir export table rva	data dir export table size	data dir import address table rva	data dir import address table size	data dir import table rva	data dir import table size	data dir resource table rva	data dir resource table size
0	4096	483328	8462	0	1357217520	835584	28956	0	0	633648	76	483328	1688	624872	260	811008	24256	...
1	4096	270336	8450	408181	1365113048	380928	12380	5472	28	267296	134	4096	1292	260852	260	274432	103416	...
2	4096	28672	271	2018515944	1260053446	0	0	0	0	0	0	28672	652	29604	180	208896	16944	...
3	4096	28672	271	5723026	1260053452	0	0	0	0	0	0	28672	652	29872	180	299008	131024	...
4	4096	151552	8450	313813	1306975492	241664	10008	0	0	195360	1163	151552	1128	189700	220	221184	19560	...

	f-prot	filename	kaspersky	label	malwarebytes	mcafee	positives	sophos	symantec
0	W32/Agent.EW.gen!Eldorado	0027e07dccc0ecbd051591607262bfd5d856adecf986e6...	No detection	malicious	No detection	Artemis!A9D90198DF20	35	No detection	No detection
1	No detection	011809e9e92f82018c0e2425fa976d071b7acbff7a342d...	No detection	nonmalicious	No detection	No detection	0	No detection	No detection
2	No detection	0130326c71bc3fe20fe13e7e2aac753fb6c178a4c1dd50...	No detection	malicious	PUP.Optional.Domalq	Artemis!A2ABED494338	21	DomainIQ pay-per install	Trojan.ADH
3	W32/Trojan3.IUT	013d8c4ea64f1c5ce424ae224ae65adfd11a3982d3faba...	not-a-virus:AdWare.Win32.Lyckriks.cw	malicious	PUP.Optional.OpenCandy.A	Adware-OpenCandy.dll	28	Generic PUA LN	WS.Reputation.1
4	No detection	01a8d23e2b114162262eaffd1c311450b56efdf3063372...	No detection	nonmalicious	No detection	No detection	0	No detection	No detection

		filename
cluster	label
-1	malicious	245
-1	nonmalicious	337
0	malicious	1
0	nonmalicious	2
1	malicious	3
2	nonmalicious	16
3	nonmalicious	5
4	nonmalicious	23
5	malicious	32
6	nonmalicious	13

	f-prot	mcafee	symantec	sophos	kaspersky	malwarebytes
59	W32/VB.ID.gen!Eldorado	W32/Autorun.worm.aaeh	W32.Changeup	Troj/Agent-ZQE	Worm.Win32.Vobfus.atyr	No detection
188	W32/Vobfus.AI.gen!Eldorado	Generic VB.kk	W32.Changeup!gen15	Mal/SillyFDC-U	Trojan.Win32.VBKrypt.izdo	Worm.Obfuscated
327	W32/Vobfus.BE.gen!Eldorado	Generic Downloader.oq	W32.Changeup	Mal/SillyFDC-W	Worm.Win32.WBNA.bul	Worm.Obfuscated
369	W32/Vobfus.O.gen!Eldorado	VBObfus.dv	W32.Changeup	W32/Autorun-BWV	Worm.Win32.WBNA.ipa	Worm.Obfuscated
503	W32/Vobfus.AA.gen!Eldorado	Generic VB.kk	W32.Changeup	Mal/SillyFDC-T	Worm.Win32.WBNA.bqi	No detection
506	W32/Vobfus.AQ.gen!Eldorado	VBObfus.dv	W32.Changeup	W32/Vobfus-AI	Worm.Win32.WBNA.ipa	Trojan.Downloader.ic
816	W32/Vobfus.AA.gen!Eldorado	Generic VB.kk	W32.Changeup!gen15	Mal/SillyFDC-T	Worm.Win32.WBNA.bmf	Trojan.Downloader.ic
873	W32/Vobfus.AI.gen!Eldorado	Generic VB.kk	W32.Changeup!gen17	Mal/VBCheMan-B	Worm.Win32.WBNA.bul	Worm.Obfuscated
921	W32/Vobfus.O.gen!Eldorado	W32/Autorun.worm.aaeh	W32.Changeup	W32/Autorun-BWV	Worm.Win32.WBNA.ipa	Trojan.Downloader.ic